Our main goal: to figure out what accounts the participants are following
Problem: Reverse chronological endpoint is capped (limited 10k tweets/month)
To find out: How much we can spend, how many tweets we need per participant
To get a sense of this, we are analyzing Rockwell pilot data and trying to estimate…
Strategy:
X: number of tweets collected
Y: fraction of distinct accounts appearing in these tweets pulled
See where the plot becomes flat (find the elbow point/saturating point)
# of distinct accounts appearing in tweets /
total # of distinct friends that each user has.# of distinct accounts appearing in tweets /
total # of distinct accounts that each user sees.# of tweets collected in all plots. However, the problem is
that this x-axis is not comparable across users since they have
different number of friends - e.g. some users follow lots of friends
thus naturally, they get more number of tweets; some others follow only
few accounts thereby getting few number of tweets. Since the y-axis is
in fraction/relative terms, the new rescaled x-axis is also added.
# of tweets collected)
by the average tweets per second of each user.To ease comparison, old version and the new version plots are drawn side-by-side. See notes on each plot.
library(readr)
library(tidyverse)
library(ggplot2)
library(ggthemes)
library(grid)
library(gridExtra)
library(DT)
library(lubridate)
library(scales)
# Load data
# The data is cleaned and exported using Python.
# Python code for making dataframe for R is provided upon request.
df <- read_csv("df.csv")
# Define data type
df %>%
mutate(
user_id = as.factor(user_id),
tweet_id = as.factor(tweet_id),
friend_id = as.factor(account_id)
) %>%
dplyr::select(-account_id) -> df
# Making new variables
# (1) Maximum number of each user's friends: max_friends_count
# What is the maximum of user_friends_count?
df %>%
dplyr::select(user_id, user_friends_count) %>%
distinct() %>%
group_by(user_id) %>%
mutate(max_friends_count = max(user_friends_count)) %>%
dplyr::select(-user_friends_count) %>%
distinct() -> max_data
# max_data: a new dataframe with `user_id` and `max_friends_count` as variables
# Merge this 'max_data' into df
df %>%
merge(max_data, by="user_id") -> df
# (2) Timestamp data of each tweet: tweet_timestamp
# Let's clean timestamp data to have affinity with R lubridate pacakge
list_timestamp <- str_split(df$tweet_timestamp, " ") # make a list containing each string component of timestamp data
Month = c() # make an empty vector
Day = c()
Time = c()
Year = c()
timestamp_dmyt = c()
# fill these vectors with month, day, time, year components within each list element
for (i in 1:length(list_timestamp)) {
list_timestamp[[i]][2] -> Month[i]
list_timestamp[[i]][3] -> Day[i]
list_timestamp[[i]][4] -> Time[i]
list_timestamp[[i]][6] -> Year[i]
}
timestamp_dmyt = as.data.frame(cbind(Day, Month, Year, Time)) # bind these filled vectors and make it as a dataframe; store this dataframe as 'timestamp_dmyt'
# now paste the strings into one and store them in a vector 'dmyt'
dmyt = c()
for (i in 1:nrow(timestamp_dmyt)) {
dmyt[i] = paste(Day[i], Month[i], Year[i], Time[i])
}
# make 'dmyt' vector as a new variable of dataframe: 'tweet_timestamp'
df$tweet_timestamp = dmy_hms(dmyt) # timestamp format: day-month-year-hour-minute-second
# (3) Define x-axis: number of tweets collected
df %>%
arrange(tweet_timestamp) %>% # arrange the data by time
group_by(user_id) %>%
count(tweet_id) %>%
mutate(
old_x = cumsum(n), # old_x: number of tweets collected so far
max_n_tweets = max(old_x)
) %>%
dplyr::select(
user_id, tweet_id, old_x, max_n_tweets
) -> df_for_x
df %>%
inner_join(df_for_x, by=c("user_id", "tweet_id")) %>%
arrange(user_id, tweet_timestamp) -> df
#* [X] Re-scaling → divide x axis by the average tweets per second of each participant.
#* For each participant, (1) take the first and last tweet in the data and compute the number of seconds between them, and then (2) divide the total number of tweets seen for the participant by the number of seconds.
df |>
group_by(user_id) |>
summarise(timediff = max(tweet_timestamp) - min(tweet_timestamp)) -> timeDiff
df |>
merge(timeDiff, by="user_id") |>
group_by(user_id) |>
mutate(
avg_n_tweets_persec = max_n_tweets / as.numeric(timediff)
) |>
ungroup() |>
mutate(
new_x = old_x / avg_n_tweets_persec # new_x: number of tweets collected so far divided by the average number of tweets per seconds
) -> df2
# (4) Define y-axis: count how many distinct accounts are in the tweets (numerator)
#* [Y] Re-scaling → make a fraction for y-axis (new denominator as Brendan suggested : maximum of the total distinct accounts you "see" (not you have) as a new denominator to make all 60 indviduals reach 1 in the end (individual plots))
df2 %>%
arrange(user_id, tweet_timestamp) %>%
group_by(user_id) %>%
mutate(
numerator = cumsum(!duplicated(friend_id)),
old_y = numerator / max_friends_count,
new_y = numerator / max(numerator)
) -> df2
# df2 is the final data for drawing plots
| Variable | Definition |
|---|---|
| old_x | Number of tweets collected so far |
| new_x | Number of tweets collected so far / Average number of tweets per second |
| old_y | Fraction of distinct accounts appeared in tweets over the maximum number of friends each user has (= How many distinct accounts among all the friends that each user has have appeared in tweets pulled so far?) |
| new_y | Fraction of distinct accounts appeared in tweets over the maximum number of friends each user sees from the tweets pulled so far (thus everyone reaches 1 at the end) |
Plots 1a-1d are scatter plots displaying patterns of change in the fraction of distinct accounts as we pull tweets.
df2 %>%
group_by(user_id) %>%
ggplot(aes(x=old_x, y=old_y, col=user_id)) +
geom_point(alpha=0.5) +
theme_few() +
theme(legend.position="none") +
xlab("Number of Tweets Collected") +
ylab("Fraction of Distinct Accounts (%)") +
scale_x_continuous(n.breaks = 10, limits=c(0, 120000),
labels = label_number(scale_cut = cut_short_scale())) +
ggtitle("Plot 1a", subtitle = "old x (# of tweets), \nold y (# distinct accounts/max friends count)") -> plot1a
df2 %>%
group_by(user_id) %>%
ggplot(aes(x=old_x, y=new_y, col=user_id)) +
geom_point(alpha=0.5) +
theme_few() +
theme(legend.position="none") +
xlab("# of Tweets Collected") +
ylab("Fraction of Distinct Accounts (%)") +
scale_x_continuous(n.breaks = 10, limits=c(0, 120000),
labels = label_number(scale_cut = cut_short_scale())) +
ggtitle("Plot 1b", subtitle = "old x (# of tweets), \nnew y (# distinct accounts/max distinct accounts seen)") -> plot1b
df2 %>%
group_by(user_id) %>%
ggplot(aes(x=new_x, y=old_y, col=user_id)) +
geom_point(alpha=0.5) +
theme_few() +
theme(legend.position="none") +
xlab("# of Tweets Collected / Avg # of Tweets per sec") +
ylab("Fraction of Distinct Accounts (%)") +
scale_x_continuous(n.breaks = 10, limits = c(0, 30000000),
labels = label_number(scale_cut = cut_short_scale())) +
ggtitle("Plot 1c", subtitle = "new x (rescaled # of tweets), \nold y (# distinct accounts/max friends count)") -> plot1c
df2 %>%
group_by(user_id) %>%
ggplot(aes(x=new_x, y=new_y, col=user_id)) +
geom_point(alpha=0.5) +
theme_few() +
theme(legend.position="none") +
xlab("# of Tweets Collected / Avg # of Tweets per sec") +
ylab("Fraction of Distinct Accounts (%)") +
scale_x_continuous(n.breaks = 10, limits = c(0, 30000000),
labels = label_number(scale_cut = cut_short_scale())) +
ggtitle("Plot 1d", subtitle = "new x (rescaled # of tweets), \nnew y (# distinct accounts/max distinct accounts seen)") -> plot1d
grid.arrange(plot1a, plot1b, plot1c, plot1d, nrow=2)
What happens when we set the x-axis (and y-axis) to common logarithmic scales?
plot1a +
xlab("Log(Number of Tweets Collected)") +
scale_x_log10(n.breaks=10,
labels = scales::label_log()) +
ggtitle("Plot 2a", subtitle = "Logged old x (# of tweets), \nold y (# distinct accounts/max friends count)") -> plot2a
plot1b +
xlab("Log(Number of Tweets Collected)") +
scale_x_log10(n.breaks=10,
labels = scales::label_log()) +
ggtitle("Plot 2b", subtitle = "Logged old x (# of tweets), \nnew y (# distinct accounts/max distinct accounts seen)") -> plot2b
plot1c +
xlab("Log(# of Tweets Collected / Avg # of Tweets per sec)") +
scale_x_log10(n.breaks=10,
labels = scales::label_log()) +
ggtitle("Plot 2c", subtitle = "Logged new x (rescaled # of tweets), \nold y (# distinct accounts/max friends count)") -> plot2c
plot1d +
xlab("Log(# of Tweets Collected / Avg # of Tweets per sec)") +
scale_x_log10(n.breaks=10,
labels = scales::label_log()) +
ggtitle("Plot 2d", subtitle = "Logged new x (rescaled # of tweets), \nnew y (# distinct accounts/max distinct accounts seen)") -> plot2d
plot2c +
scale_y_log10(n.breaks=10, labels = scales::label_log()) +
ggtitle("Plot 2e", subtitle = "Logged new x (rescaled # of tweets), \nLogged old y (# distinct accounts/max friends count)") -> plot2e
plot2d +
scale_y_log10(n.breaks=10, labels = scales::label_log()) +
ggtitle("Plot 2f", subtitle = "Logged new x (rescaled # of tweets), \nLogged new y (# distinct accounts/max distinct accounts seen)") -> plot2f
grid.arrange(plot2a, plot2b, plot2c, plot2d, plot2e, plot2f, ncol=2)
What if we aggregate users by taking the mean of fraction of distinct accounts (=y) at each point of the tweets collected (=x)?
For plots with the rescaled x-axis (=
# of tweets/avg # of tweets per sec), I applied binning on
the x-axis and then calculated weighted average of y.
This is due to the fact that avg # of tweets per sec
is highly distinct among users thus making it almost useless to group by
each point at the x-axis and summarizing the mean value of y.
Now that the same bin may contain multiple observations of the same user, or that some users may not appear at all in certain bins, I apply weighted average where each user is weighted by the number of observation in each bin.
# old_x, old_y
df2 %>%
group_by(old_x) %>%
summarize(y = mean(old_y)) %>%
ungroup() %>%
ggplot(aes(x=old_x, y=y)) +
geom_point(alpha=0.5) +
geom_smooth(color='darkcyan', linewidth=1) +
theme_few() +
theme(legend.position="none") +
xlab("# of Tweets Collected") +
ylab("Mean Fraction of Distinct Accounts (%)") +
scale_x_continuous(n.breaks = 10, limits = c(0, 120000),
labels = label_number(scale_cut = cut_short_scale())) +
ylim(c(0, 1)) +
ggtitle("Plot 3a", subtitle = "old x (# of tweets),\nold y (# distinct accounts/max friends count)") -> plot3a
# old_x, new_y
df2 %>%
group_by(old_x) %>%
summarize(y = mean(new_y)) %>%
ungroup() %>%
ggplot(aes(x=old_x, y=y)) +
geom_point(alpha=0.5) +
geom_smooth(color='darkcyan', linewidth=1) +
theme_few() +
theme(legend.position="none") +
xlab("# of Tweets Collected") +
ylab("Mean Fraction of Distinct Accounts (%)") +
scale_x_continuous(n.breaks = 10, limits = c(0, 120000),
labels = label_number(scale_cut = cut_short_scale())) +
ylim(c(0, 1)) +
ggtitle("Plot 3b", subtitle = "old x (# of tweets), \nnew y (# distinct accounts/max distinct accounts seen)") -> plot3b
# make a new dataframe with binned x-axis and weighted mean
df2 %>%
mutate(
bins = cut(new_x ,
breaks = pretty(new_x, n = (max(new_x)-min(new_x))/100000), # 1057 levels
include.lowest = TRUE)) %>%
group_by(user_id, bins) %>%
mutate(weights = n()) %>%
ungroup() %>%
group_by(bins) %>%
summarise(old_y_weighted = weighted.mean(old_y, weights),
new_y_weighted = weighted.mean(new_y, weights)) %>%
ungroup() -> df3
# new_x, old_y
df3 %>%
mutate(bins_x = as.integer(bins)) %>%
ggplot(aes(x=bins_x, y=old_y_weighted)) +
geom_point(alpha=0.5) +
theme_few() +
xlab("Bins of [# of Tweets Collected / Avg # of Tweets per sec]") +
ylab("Mean Fraction of Distinct Accounts (%)") +
scale_x_continuous(n.breaks = 10, limits=c(0, 1000),
labels = label_number()) +
ylim(c(0, 1)) +
ggtitle("Plot 3c", subtitle = "Binned new x (rescaled # of tweets), \nold y (# distinct accounts/max friends count)") -> plot3c
# new_x, new_y
df3 %>%
mutate(bins_x = as.integer(bins)) %>%
ggplot(aes(x=bins_x, y=new_y_weighted)) +
geom_point(alpha=0.5) +
theme_few() +
xlab("Bins of [# of Tweets Collected / Avg # of Tweets per sec]") +
ylab("Mean Fraction of Distinct Accounts (%)") +
scale_x_continuous(n.breaks = 10, limits=c(0, 1000),
labels = label_number()) +
ylim(c(0, 1)) +
ggtitle("Plot 3d", subtitle = "Binned new x (rescaled # of tweets), \nnew y (# distinct accounts/max distinct accounts seen)") -> plot3d
# Let's zoom in plot3c and plot3d:
df3 %>%
mutate(bins_x = as.integer(bins)) %>%
filter(bins_x < 160) %>%
ggplot(aes(x=bins_x, y=old_y_weighted)) +
geom_point(alpha=0.5) +
theme_few() +
xlab("Bins of [# of Tweets Collected / Avg # of Tweets per sec]") +
ylab("Mean Fraction of Distinct Accounts (%)") +
scale_x_continuous(n.breaks = 10, limits=c(0, 160),
labels = label_number()) +
ylim(c(0, 1)) +
ggtitle("Plot 3c | Zoomed In", subtitle = "Binned new x (rescaled # of tweets), \nold y (# distinct accounts/max friends count)") +
geom_vline(xintercept=153, lty=2, color="darkcyan") +
geom_vline(xintercept=74, lty=2, color="darkcyan") -> plot3c_zoom
df3 %>%
mutate(bins_x = as.integer(bins)) %>%
filter(bins_x < 160) %>%
ggplot(aes(x=bins_x, y=new_y_weighted)) +
geom_point(alpha=0.5) +
theme_few() +
xlab("Bins of [# of Tweets Collected / Avg # of Tweets per sec]") +
ylab("Mean Fraction of Distinct Accounts (%)") +
scale_x_continuous(n.breaks = 10, limits=c(0, 160),
labels = label_number()) +
ylim(c(0, 1)) +
ggtitle("Plot 3d | Zoomed In", subtitle = "Binned new x (rescaled # of tweets), \nnew y (# distinct accounts/max distinct accounts seen)") +
geom_vline(xintercept=153, lty=2, color="darkcyan") +
geom_vline(xintercept=74, lty=2, color="darkcyan") -> plot3d_zoom
grid.arrange(plot3a, plot3b, plot3c, plot3d, plot3c_zoom, plot3d_zoom, ncol=2)
df3 %>%
mutate(bins_x = as.integer(bins)) %>%
dplyr::select(bins_x, bins) %>%
unique() -> table_bin
datatable(table_bin,
caption = "Bin No. & Bin Range",
filter="top")
Or this version of aggregate plots?
df2 %>%
group_by(old_x) %>%
summarize(y_old = mean(old_y), y_new = mean(new_y)) %>%
ungroup() %>%
pivot_longer(cols = c("y_old","y_new")) %>%
ggplot(aes(x=old_x, y=value, col=name)) +
geom_point(alpha=0.5) +
theme_few() +
theme(legend.position="none") +
xlab("# of Tweets Collected") +
ylab("Mean Fraction of Distinct Accounts (%)") +
scale_x_continuous(n.breaks = 10, limits = c(0, 120000),
labels = label_number(scale_cut = cut_short_scale())) +
ylim(c(0, 1)) +
ggtitle("Plot 3ab: old x (# of tweets)", subtitle = "Pink: mean of new y \nBlue: mean of old y") -> plot3ab
df3 %>%
mutate(bins_x = as.integer(bins)) %>%
pivot_longer(cols=c("old_y_weighted", "new_y_weighted")) %>%
ggplot(aes(x=bins_x, y=value, col=name)) +
geom_jitter(alpha=0.7, width=0.5, height=0.005) +
theme_few() +
theme(legend.position="none") +
xlab("Bins of [# of Tweets Collected / Avg # of Tweets per sec]") +
ylab("Mean Fraction of Distinct Accounts (%)") +
scale_x_continuous(n.breaks = 10, limits=c(0, 1000),
labels = label_number()) +
ylim(c(0, 1)) +
ggtitle("Plot 3cd: Binned new x (rescaled # of tweets)", subtitle = "Pink: weighted mean of new y \nBlue: weighted mean of old y") +
geom_vline(xintercept=153, lty=2, color="darkcyan") +
geom_vline(xintercept=74, lty=2, color="darkcyan") -> plot3cd
grid.arrange(plot3ab, plot3cd)
What happens when we set the x-axis (and y-axis) to common logarithmic scales and replicate plots 3a~3d?
plot3a +
xlab("Log(Number of Tweets Collected)") +
scale_x_log10(n.breaks=10,
labels = scales::label_log()) +
ggtitle("Plot 4a", subtitle = "Logged old x (# of tweets), \nold y (# distinct accounts/max friends count)") -> plot4a
plot3b +
xlab("Log(Number of Tweets Collected)") +
scale_x_log10(n.breaks=10,
labels = scales::label_log()) +
ggtitle("Plot 4b", subtitle = "Logged old x (# of tweets), \nnew y (# distinct accounts/max distinct accounts seen)") -> plot4b
plot3c +
xlab("Log( Bins of [# of Tweets Collected / Avg # of Tweets per sec] )") +
scale_x_log10(n.breaks=10,
labels = scales::label_log()) +
ggtitle("Plot 4c", subtitle = "Logged & Binned new x (rescaled # of tweets), \nold y (# distinct accounts/max friends count)") -> plot4c
plot3d +
xlab("Log( Bins of [# of Tweets Collected / Avg # of Tweets per sec] )") +
scale_x_log10(n.breaks=10,
labels = scales::label_log()) +
ggtitle("Plot 4d", subtitle = "Looged & Binned new x (rescaled # of tweets), \nnew y (# distinct accounts/max distinct accounts seen)") -> plot4d
plot4c +
scale_y_log10(n.breaks=10, labels = scales::label_log()) +
ggtitle("Plot 4e", subtitle = "Logged & Binned new x (rescaled # of tweets), \nLogged old y (# distinct accounts/max friends count)") -> plot4e
plot4d +
scale_y_log10(n.breaks=10, labels = scales::label_log()) +
ggtitle("Plot 4f", subtitle = "Logged & Binned new x (rescaled # of tweets), \nLogged new y (# distinct accounts/max distinct accounts seen)") -> plot4f
grid.arrange(plot4a, plot4b, plot4c, plot4d, plot4e, plot4f, ncol=2)
Or …. like this?
plot3ab +
xlab("Log(# of Tweets Collected)") +
scale_x_log10(n.breaks=10,
labels = scales::label_log()) +
ggtitle("Plot 4ab: Logged old x (# of tweets)", subtitle = "Pink: mean of new y \nBlue: mean of old y") -> plot4ab
df3 %>%
mutate(bins_x = as.integer(bins)) %>%
pivot_longer(cols=c("old_y_weighted", "new_y_weighted")) %>%
ggplot(aes(x=bins_x, y=value, col=name)) +
geom_jitter(alpha=0.7, height=0.005) +
theme_few() +
theme(legend.position="none") +
ylab("Mean Fraction of Distinct Accounts (%)") +
ylim(c(0, 1)) +
xlab("Log(Bins of [# of Tweets Collected / Avg # of Tweets per sec])") +
scale_x_log10(n.breaks=10,
labels = scales::label_log()) +
ggtitle("Plot 4cd: Logged & Binned new x (rescaled # of tweets)", subtitle = "Pink: weighted mean of new y \nBlue: weighted mean of old y") -> plot4cd
plot4ab +
ylab("Log(Mean Fraction of Distinct Accounts (%))") +
scale_y_log10(n.breaks=10, labels = scales::label_log()) +
ggtitle("Plot 4ab_2: Logged old x (# of tweets)", subtitle = "Pink: Logged weighted mean of new y \nBlue: Logged weighted mean of old y") -> plot4ab_2
plot4cd +
ylab("Log(Mean Fraction of Distinct Accounts (%))") +
scale_y_log10(n.breaks=10, labels = scales::label_log()) +
ggtitle("Plot 4ef: Logged & Binned new x (rescaled # of tweets)", subtitle = "Pink: Logged weighted mean of new y \nBlue: Logged weighted mean of old y") -> plot4ef
grid.arrange(plot4ab, plot4cd, plot4ab_2, plot4ef)
In the data frame, there are 60 unique users. Let’s redraw some of the plots by each individual user. I allowed scales of the x-axis to vary for each user.
df2 %>%
ggplot(aes(x=old_x, y=old_y, col=user_id)) +
geom_point(alpha=0.5) +
theme_few() +
theme(legend.position="none") +
xlab("# of Tweets Collected") +
ylab("Fraction of Distinct Accounts Appearing in Tweets (%)") +
scale_x_continuous(n.breaks = 5,
labels = label_number(scale_cut = cut_short_scale())) +
facet_wrap(~user_id, scales="free_x", ncol = 10) +
ggtitle("Plot 5a", subtitle = "old x (# of tweets), \nold y (# distinct accounts/max friends count)")
df2 %>%
ggplot(aes(x=old_x, y=new_y, col=user_id)) +
geom_point(alpha=0.5) +
theme_few() +
theme(legend.position="none") +
xlab("# of Tweets Collected") +
ylab("Fraction of Distinct Accounts Appearing in Tweets (%)") +
scale_x_continuous(n.breaks = 5,
labels = label_number(scale_cut = cut_short_scale())) +
facet_wrap(~user_id, scales="free_x", ncol = 10) +
ggtitle("Plot 5b", subtitle = "old x (# of tweets), \nnew y (# distinct accounts/max distinct accounts seen)")
df2 %>%
ggplot(aes(x=new_x, y=old_y, col=user_id)) +
geom_point(alpha=0.5) +
theme_few() +
theme(legend.position="none") +
xlab("# of Tweets Collected") +
ylab("Fraction of Distinct Accounts Appearing in Tweets (%)") +
scale_x_continuous(n.breaks = 5,
labels = label_number(scale_cut = cut_short_scale())) +
facet_wrap(~user_id, scales="free_x", ncol = 10) +
ggtitle("Plot 5c", subtitle = "new x (rescaled # of tweets), \nold y (# distinct accounts/max friends count)")
df2 %>%
ggplot(aes(x=new_x, y=new_y, col=user_id)) +
geom_point(alpha=0.5) +
theme_few() +
theme(legend.position="none") +
xlab("# of Tweets Collected") +
ylab("Fraction of Distinct Accounts Appearing in Tweets (%)") +
scale_x_continuous(n.breaks = 5,
labels = label_number(scale_cut = cut_short_scale())) +
facet_wrap(~user_id, scales="free_x", ncol = 10) +
ggtitle("Plot 5d", subtitle = "new x (rescaled # of tweets), \nnew y (# distinct accounts/max distinct accounts seen)")
It seems some people follow very few accounts while some others follow very many accounts. Let’s check distribution of the friends counts as well as maximum number of accounts observed in collected tweets.
df2 %>%
group_by(user_id) %>%
mutate(max_accounts_seen = max(numerator)) %>%
distinct(user_id, max_friends_count, max_accounts_seen) %>%
arrange(-desc(max_friends_count)) -> table_dta
datatable(table_dta, filter="top")
Let’s remove users whose max_accounts_seen is less than 10 & more than 1,000 - and re-draw aggregate plots.
df2 %>%
group_by(user_id) %>%
mutate(max_accounts_seen = max(numerator)) %>%
filter(max_accounts_seen >= 10 & max_accounts_seen <= 1000) %>%
ungroup() %>%
group_by(old_x) %>%
summarize(y_old = mean(old_y), y_new = mean(new_y)) %>%
ungroup() %>%
pivot_longer(cols = c("y_old","y_new")) %>%
ggplot(aes(x=old_x, y=value, col=name)) +
geom_point(alpha=0.5) +
theme_few() +
theme(legend.position="none") +
xlab("# of Tweets Collected") +
ylab("Mean Fraction of Distinct Accounts (%)") +
scale_x_continuous(n.breaks = 10, limits = c(0, 45000),
labels = label_number(scale_cut = cut_short_scale())) +
ggtitle("Plot 6ab (w/o outliers): old x (# of tweets)", subtitle = "Pink: mean of new y \nBlue: mean of old y") -> plot6ab
df2 %>%
mutate(
bins = cut(new_x ,
breaks = pretty(new_x, n = (max(new_x)-min(new_x))/100000), # 1057 levels
include.lowest = TRUE)) %>%
group_by(user_id, bins) %>%
mutate(weights = n()) %>%
ungroup() %>%
group_by(user_id) %>%
mutate(max_accounts_seen = max(numerator)) %>%
filter(max_accounts_seen >= 10 & max_accounts_seen <= 1000) %>%
ungroup() %>%
group_by(bins) %>%
summarise(old_y_weighted = weighted.mean(old_y, weights),
new_y_weighted = weighted.mean(new_y, weights)) %>%
ungroup() -> df4
df4 %>%
mutate(bins_x = as.integer(bins)) %>%
pivot_longer(cols=c("old_y_weighted", "new_y_weighted")) %>%
ggplot(aes(x=bins_x, y=value, col=name)) +
geom_jitter(alpha=1, width=0.5, height=0.005) +
theme_few() +
theme(legend.position="none") +
xlab("Bins of [# of Tweets Collected / Avg # of Tweets per sec]") +
ylab("Mean Fraction of Distinct Accounts (%)") +
scale_x_continuous(n.breaks = 10, limits=c(0, 80),
labels = label_number()) +
ggtitle("Plot 6cd (w/o outliers): new x (rescaled # of tweets)", subtitle = "Pink: weighted mean of new y \nBlue: weighted mean of old y") +
geom_vline(xintercept=29, lty=2, color="darkcyan") +
geom_vline(xintercept=77, lty=2, color="darkcyan") -> plot6cd
# bin no. 29: (2,800,000 , 2,900,000]
# bin no. 77: (7,300,000 , 7,400,000]
grid.arrange(plot6ab, plot6cd)
First, I binned the rescaled x-axis into 1057 levels in total
Second, by each user, I calculated the frequencies of each user within each bin (=weights)
Third, by each bin, I calculated the weighted mean of the y-axis, weighing by the weights calculated in the second step
Therefore, each bin get to have a single value of weighted mean of y
Which plot to base our final decision on?
After we narrow down on a few plots … I will redraw them with the distinct # of low-quality sources on the y-axis !